Unsupervised Learning: Trade&Ahead¶

Marks: 60

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [50]:
#Libraries to help with reading data and manipulation
import numpy as np
import pandas as pd

#Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#to scale the data using z score
from sklearn.preprocessing import StandardScaler

#to compute distances
from scipy.spatial.distance import cdist
from scipy.spatial.distance import pdist

#to perform hierarchical clustering, compute cophenetic coorelation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

#to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

#to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

#to perform pca
from sklearn.decomposition import PCA

#to suppress warnings
import warnings

warnings.filterwarnings("ignore")

Dataset loading¶

In [51]:
data=pd.read_csv("stock_data.csv")

Data Overview¶

  • Observations
  • Sanity checks

shape of the dataset¶

In [52]:
data.shape
Out[52]:
(340, 15)

dataset random rows¶

In [53]:
#viewing a random sample of the dataset
data.sample(n=10, random_state=1)
Out[53]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
102 DVN Devon Energy Corp. Energy Oil & Gas Exploration & Production 32.000000 -15.478079 2.923698 205 70 830000000 -14454000000 -35.55 4.065823e+08 93.089287 1.785616
125 FB Facebook Information Technology Internet Software & Services 104.660004 16.224320 1.320606 8 958 592000000 3669000000 1.31 2.800763e+09 79.893133 5.884467
11 AIV Apartment Investment & Mgmt Real Estate REITs 40.029999 7.578608 1.163334 15 47 21818000 248710000 1.52 1.636250e+08 26.335526 -1.269332
248 PG Procter & Gamble Consumer Staples Personal Products 79.410004 10.660538 0.806056 17 129 160383000 636056000 3.28 4.913916e+08 24.070121 -2.256747
238 OXY Occidental Petroleum Energy Oil & Gas Exploration & Production 67.610001 0.865287 1.589520 32 64 -588000000 -7829000000 -10.23 7.652981e+08 93.089287 3.345102
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 4.353535e+08 17.682214 -3.838260
112 EQT EQT Corporation Energy Oil & Gas Exploration & Production 52.130001 -21.253771 2.364883 2 201 523803000 85171000 0.56 1.520911e+08 93.089287 9.567952
147 HAL Halliburton Co. Energy Oil & Gas Equipment & Services 34.040001 -5.101751 1.966062 4 189 7786000000 -671000000 -0.79 8.493671e+08 93.089287 17.345857
89 DFS Discover Financial Services Financials Consumer Finance 53.619999 3.653584 1.159897 20 99 2288000000 2297000000 5.14 4.468872e+08 10.431906 -0.375934
173 IVZ Invesco Ltd. Financials Asset Management & Custody Banks 33.480000 7.067477 1.580839 12 67 412000000 968100000 2.26 4.283628e+08 14.814159 4.218620

creating a copy of original data¶

In [54]:
df=data.copy()
In [55]:
df.columns=[c.replace(" ","_") for c in df.columns]

checking the data types of the columns of the dataset¶

In [56]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker_Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS_Sector                   340 non-null    object 
 3   GICS_Sub_Industry             340 non-null    object 
 4   Current_Price                 340 non-null    float64
 5   Price_Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash_Ratio                    340 non-null    int64  
 9   Net_Cash_Flow                 340 non-null    int64  
 10  Net_Income                    340 non-null    int64  
 11  Earnings_Per_Share            340 non-null    float64
 12  Estimated_Shares_Outstanding  340 non-null    float64
 13  P/E_Ratio                     340 non-null    float64
 14  P/B_Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB

we can drop ticker symbol as it is used to identify the stock and doesnt provide any insights to our analysis¶

In [57]:
df.drop("Ticker_Symbol", axis=1, inplace=True)

duplicate check¶

In [58]:
df.duplicated().sum()
Out[58]:
0

no duplicate values

statistical summary of the dataset¶

In [59]:
df.describe()
Out[59]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio
count 340.000000 340.000000 340.000000 340.000000 340.000000 3.400000e+02 3.400000e+02 340.000000 3.400000e+02 340.000000 340.000000
mean 80.862345 4.078194 1.525976 39.597059 70.023529 5.553762e+07 1.494385e+09 2.776662 5.770283e+08 32.612563 -1.718249
std 98.055086 12.006338 0.591798 96.547538 90.421331 1.946365e+09 3.940150e+09 6.587779 8.458496e+08 44.348731 13.966912
min 4.500000 -47.129693 0.733163 1.000000 0.000000 -1.120800e+10 -2.352800e+10 -61.200000 2.767216e+07 2.935451 -76.119077
25% 38.555000 -0.939484 1.134878 9.750000 18.000000 -1.939065e+08 3.523012e+08 1.557500 1.588482e+08 15.044653 -4.352056
50% 59.705000 4.819505 1.385593 15.000000 47.000000 2.098000e+06 7.073360e+08 2.895000 3.096751e+08 20.819876 -1.067170
75% 92.880001 10.695493 1.695549 27.000000 99.000000 1.698108e+08 1.899000e+09 4.620000 5.731175e+08 31.764755 3.917066
max 1274.949951 55.051683 4.580042 917.000000 958.000000 2.076400e+10 2.444200e+10 50.090000 6.159292e+09 528.039074 129.064585

observations¶

  • 0 as the minimum value in the cash_ratio column indicates that a company had no liabilites and they had no cash, which doesnt make any sense
  • the average current price for a stock is 80 dollars
  • the price changes at an average 4 percent per stock
  • the average volatility for the stock in 13 weeks is 1 dollar and
  • the average roe for a stock is 39 dollars
  • the average cash ratio is 70 dollars
  • the average net cash flow for a stock is about 55,537,620 dollars
  • the average net income for a stock is 1,494,385,000 dollars
  • the average earnings per share for a company is 2 dollars and 77 cents
  • the average estimated shares that are outstanding, which is basically the number of company stocks that are held by the shareholders are 577 million
  • the average p/e ratio for a stock is 33
  • the average p/b ratio is -1.71
  • numeric values need to be scaled as they are all calculated differently and have different weight before they are clustered
  • there are missing values in the data that need to be treated
  • negative values make sense, as they can represent losses so we wil not drop any negative values

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
In [60]:
#function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):
    """
    
    Barplot with percentage at the top
    
    data=dataframe
    feature:dataframe column
    perc:whether to display percentages instead of count (default is false)
    n:displays the top n category levels (default is none, i.e., display all levels)
    """
    
    total=len(data[feature])
    count=data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count+1,5))
    else:
        plt.figure(figsize=(n+1,5))
        
    plt.xticks(rotation=90, fontsize=15)
    ax=sns.countplot(
    data=data,
    x=feature,
    palette="Paired",
    order=data[feature].value_counts().index[:n].sort_values(),
    )
    
    for p in ax.patches:
        if perc == True:
            label="{:1f}%".format(
                100*p.get_height()/total
            ) #percentage of each class of the category
        else:
            label=p.get_height() #count of each level of the category
            
        x=p.get_x()+p.get_width() / 2 #width of the plot
        y=p.get_height() #height of the plot
        
        ax.annotate(
            label,
            (x,y),
            ha="center",
            va="center",
            size=12,
            xytext=(0,5),
            textcoords="offset points",
        ) #annotate the percentage

plt.show() #show the plot
In [61]:
def histogram_boxplot(data,feature,figsize=(12,7), kde=False, bins=None):
    """
    Boxplot and histogram combined
    
    data=dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde:whether to show thew density curve (default False)
    bins: number of bins for histogram (default None)
    """
    
    f2, (ax_box2, ax_hist2) = plt.subplots(
    nrows=2, #Number of rows of the subplot grid= 2
        sharex=True, #x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25,0.75)},
        figsize=figsize,
    ) #creating the 2 subplots
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    ) #boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, pallette="winter"
    ) if bins else sns.histplot(
         data=data, x=feature, kde=kde, ax=ax_hist2
    ) # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    ) #add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(),color="black", linestyle="-"
    ) #add median to the histogram
In [62]:
#selecting numerical columns
num_col=df.select_dtypes(include=np.number).columns.tolist()

for item in num_col:
    histogram_boxplot(df,item)

observations¶

  • current_price has a right skewed distribution, indicating that the the stocks are priced high to begin with
  • the price_change is also skewed to the right, indicating a percentage increase in the price of a stock in the the 13 week time period
  • the roe is skewed to the right, indicating that shareholders are getting a high return on their investments
  • the cash ratio is also skewed to the right, indicating an upward trend in the amount of cash a company has on hand
  • both the net cash flow and net income are skewed to the right, indicating that customers are profiting from the stocks
  • p/e ratios are skewed to the right, indicating an increase in the relative value of each stock
  • p/b ratio is also high, indicating that a company is pricing their stock higher than their book value
  • there are a few outliers in some of the variables such as roe, cash_ratio, and net_cash_flow to name a few, but they will be kept in case the test data has a few outliers as well
In [63]:
labeled_barplot(df, "GICS_Sector")
  • the industrials sector have the highest number of stocks, followed by the financials and health care sectors

Bivariate Analysis¶

Let's check for correlations¶

In [64]:
plt.figure(figsize=(15,7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • net_income has the highest positive correlation out of all the variables with estimated_shares_outstanding,followed by net_income and earnings_per_share and earnings_per_share and current_price
  • roe and earnings_per_share are heavily negatively correlated, meaning that a low roe might lead to a share being more profitable
In [65]:
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
  • price_change seems to be normally distributed
  • most of the numerical variables are skewed to the right
  • there seems to be low or no correlation among all the numerical values

Data Preprocessing¶

  • Duplicate value check
  • Missing value treatment
  • Outlier check
  • Feature engineering (if needed)
  • Any other preprocessing steps (if needed)

Checking for missing values¶

In [66]:
df.isna().sum()
Out[66]:
Security                        0
GICS_Sector                     0
GICS_Sub_Industry               0
Current_Price                   0
Price_Change                    0
Volatility                      0
ROE                             0
Cash_Ratio                      0
Net_Cash_Flow                   0
Net_Income                      0
Earnings_Per_Share              0
Estimated_Shares_Outstanding    0
P/E_Ratio                       0
P/B_Ratio                       0
dtype: int64
  • No missing values
In [67]:
#variables used for clustering
num_col
Out[67]:
['Current_Price',
 'Price_Change',
 'Volatility',
 'ROE',
 'Cash_Ratio',
 'Net_Cash_Flow',
 'Net_Income',
 'Earnings_Per_Share',
 'Estimated_Shares_Outstanding',
 'P/E_Ratio',
 'P/B_Ratio']
In [68]:
#scaling the dataset before clustering
scaler=StandardScaler()
subset=df[num_col].copy()
subset_scaled=scaler.fit_transform(subset)
In [69]:
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)

EDA¶

  • It is a good idea to explore the data once again after manipulating it.

Univariate analysis¶

In [70]:
#selecting numerical columns
num_col=df.select_dtypes(include=np.number).columns.tolist()

for item in num_col:
    histogram_boxplot(df,item)

Bivariate analysis¶

In [71]:
plt.figure(figsize=(15,7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
In [72]:
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
  • data has not been changed, and outliers were kept in order to to improve performance in case the test data has outliers

K-means Clustering¶

elbow curve¶

In [73]:
clusters=range(1,9)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(subset_scaled_df)
    prediction=model.predict(subset_scaled_df)
    distortion=(
    sum(
        np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"),axis=1)
    )
    /subset_scaled_df.shape[0]
    )
    
    meanDistortions.append(distortion)
    
    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
    
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.382318498894466
Number of Clusters: 3 	Average Distortion: 2.2692367155390745
Number of Clusters: 4 	Average Distortion: 2.178151429073042
Number of Clusters: 5 	Average Distortion: 2.110720186207485
Number of Clusters: 6 	Average Distortion: 2.062297686937201
Number of Clusters: 7 	Average Distortion: 2.0289794220177395
Number of Clusters: 8 	Average Distortion: 1.984517747325883
Out[73]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
  • the distortion in the data seems to take a dive at 5, so we cutoff k at 4 to preserve a high amount of distortion in the data in order to create higher amount of insights in the data accoridng to the elbow curve

silhouette score¶

In [74]:
sil_score=[]
cluster_list=list(range(2,10))
for n_clusters in cluster_list:
        clusterer=KMeans(n_clusters=n_clusters)
        preds=clusterer.fit_predict((subset_scaled_df))
        # centers = clusterer.cluster_centers_
        score=silhouette_score(subset_scaled_df, preds)
        sil_score.append(score)
        print("For n clusters = {}, silhouette score is {}".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
For n clusters = 2, silhouette score is 0.43969639509980457
For n clusters = 3, silhouette score is 0.4644405674779404
For n clusters = 4, silhouette score is 0.4577225970476733
For n clusters = 5, silhouette score is 0.42436843176418354
For n clusters = 6, silhouette score is 0.3863465606304045
For n clusters = 7, silhouette score is 0.42199780823099103
For n clusters = 8, silhouette score is 0.1430038367496768
For n clusters = 9, silhouette score is 0.40205424663377165
Out[74]:
[<matplotlib.lines.Line2D at 0x2980f56e320>]
  • acccording to the silhouette scores, 2 seems to be a good value for k
In [75]:
# Finding optimal no.of clusters with silhouette coefficients
visualizer=SilhouetteVisualizer(KMeans(7,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[75]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [76]:
# finding optimal no. of clusters with silhouette coefficients
visualizer=SilhouetteVisualizer(KMeans(6,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[76]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [77]:
visualizer=SilhouetteVisualizer(KMeans(6,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[77]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [78]:
visualizer=SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[78]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [79]:
# finding optimal no of clusters with silhouette coefficients
visualizer=SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[79]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [80]:
#finding optimal no of clusters with silhouette coefficents
visualizer=SilhouetteVisualizer(KMeans(3,random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Out[80]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Selecting Final Model¶

In [81]:
kmeans=KMeans(n_clusters=5, random_state=0)
kmeans.fit(subset_scaled_df)
Out[81]:
KMeans(n_clusters=5, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=0)
In [82]:
#adding kmeans clusetr to the original dataframe
df["K_means_segments"]=kmeans.labels_

Cluster profiling¶

In [83]:
cluster_profile=df.groupby("K_means_segments").mean()
In [84]:
cluster_profile["count_in_each_segment"]=(df.groupby("K_means_segments")["Earnings_Per_Share"].count().values)
In [85]:
#Lets disply cluster profiles
cluster_profile.style.highlight_max(color="lightgreen",axis=0)
Out[85]:
  Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio count_in_each_segment
K_means_segments                        
0 246.574304 14.284326 1.769621 26.500000 279.916667 459120250.000000 1009205541.666667 6.167917 549432140.538333 90.097512 14.081386 24
1 41.373681 -14.849938 2.596790 27.285714 64.457143 34462657.142857 -1293864285.714286 -2.459714 450100420.905143 61.563930 2.476202 35
2 48.103077 6.053507 1.163964 27.538462 77.230769 773230769.230769 14114923076.923077 3.958462 3918734987.169230 16.098039 -4.253404 13
3 72.783335 0.912232 2.015435 542.666667 34.000000 -350866666.666667 -5843677777.777778 -14.735556 372500020.988889 53.574485 -8.831054 9
4 72.768128 5.701175 1.359857 25.598456 52.216216 -913081.081081 1537660934.362934 3.719247 436114647.527683 23.473934 -3.374716 259
In [86]:
plt.figure(figsize=(15,10))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i,variable in enumerate(num_col):
    plt.subplot(4, 3, i+1)
    sns.boxplot(data=df,x="K_means_segments",y=variable)
    
plt.tight_layout(pad=2.0)
In [87]:
df.groupby("K_means_segments").mean().plot.bar(figsize=(15,6))
Out[87]:
<Axes: xlabel='K_means_segments'>
  • Cluster 0 -current_price has the highest average while net_cash_flow while volatily had the lowest mean -the rest of the numerical variables did not have a high mean
  • Cluster 1 -volatility has the highest mean while net_cash_flow has the lowest mean out off all the variables in the cluster -the rest of the numerical variables did not have a high mean
  • Cluster 2 -current_price has the highest average while net_cash_flow has the lowest mean -the rest of the numerical variables did not have a high mean in comparison to the other variables
  • Cluster 3 -volatility has the highest mean while net_income has the lowest mean -the rest of the numerical variables do not have a high mean in comparison to the other variables
  • Cluster 4 -current_price has the highest average while estimated_shares_outstanding had the lowest average in the cluster -the rest of the numerical variables do not have a high average compared to the other variables in the cluster

Hierarchical Clustering¶

Checking Cophentic Correlation¶

In [88]:
#list of distance metrics
distance_metrics= ["euclidean","chebyshev","mahalanobis","cityblock"]

#list of linkage methods
linkage_methods=["single","complete","average","weighted"]

high_cophenet_corr=0
high_dm_lm=[0,0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z=linkage(subset_scaled_df,metric=dm,method=lm)
        c,coph_dists=cophenet(Z, pdist(subset_scaled_df))
        print(
        "Cophenetic correlation for {} distance and {} linkage is {}.".format(
            dm.capitalize(),lm,c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr=c
            high_dm_lm[0]=dm
            high_dm_lm[1]=lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242.
Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524588.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.792530720285.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180427.
Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818.
Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
In [89]:
#printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(),high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.

Only euclidean distance¶

In [90]:
#list of linkage methods
linkage_methods=["single", "complete", "average", "centroid", "ward", "weighted"]

high_cophenet_corr=0
high_dm_lm=[0,0]

for lm in linkage_methods:
    Z= linkage(subset_scaled_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
    print("Cophenetic correlation for {} linking is {}.".format(lm,c))
    if high_cophenet_corr < c:
       high_cophenet_corr = c
       high_dm_lm[0]="euclidean"
       high_dm_lm[1]= lm
Cophenetic correlation for single linking is 0.9232271494002922.
Cophenetic correlation for complete linking is 0.7873280186580672.
Cophenetic correlation for average linking is 0.9422540609560814.
Cophenetic correlation for centroid linking is 0.9314012446828154.
Cophenetic correlation for ward linking is 0.7101180299865353.
Cophenetic correlation for weighted linking is 0.8693784298129404.
In [91]:
#printing the compbination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
     )
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
In [92]:
#list of linkage methods
linkage_methods=["single","complete", "average", "centroid", "ward", "weighted"]

high_cophenet_corr=0
high_dm_lm=[0,0]

for lm in linkage_methods:
    Z=linkage(subset_scaled_df,metric="euclidean",method=lm)
    c,coph_dists=cophenet(Z, pdist(subset_scaled_df))
    print("Cophentic Correlation for {} linkage is {}.".format(lm,c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0]="euclidean"
        high_dm_lm[1]=lm
Cophentic Correlation for single linkage is 0.9232271494002922.
Cophentic Correlation for complete linkage is 0.7873280186580672.
Cophentic Correlation for average linkage is 0.9422540609560814.
Cophentic Correlation for centroid linkage is 0.9314012446828154.
Cophentic Correlation for ward linkage is 0.7101180299865353.
Cophentic Correlation for weighted linkage is 0.8693784298129404.
In [93]:
#printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
     )
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.

Observations

  • The highest cophenetic correlation when it comes to all the distances and linkage methods is the the euclidean distance with the average linking method

Checking Dendrograms¶

In [100]:
#list of linkage methods
linkage_methods=["single","complete","average","centroid","ward","weighted"]

#lists to save results of cophenetic correlation calculation
compare_cols=["Linkage","Cophenetic Coefficient"]
compare=[]

#to create a subplot image
fig,axs=plt.subplots(len(linkage_methods),1,figsize=(15,30))


#We will enumerate through the list of linkage methods above
##For each linkage method, we will plot the dendrogram and calculate the cophentic correlation
for i, method in enumerate(linkage_methods):
    Z=linkage(subset_scaled_df, metric="euclidean",method=method)
    
    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
    
    coph_corr, coph_dist=cophenet(Z, pdist(subset_scaled_df))
    axs[i].annotate(
    f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
    (0.80,0.80),
    xycoords="axes fraction",
    )
    
    compare.append([method,coph_corr])

Observations

  • Ward linkage seems to have the most distinct clusters
In [98]:
df_cc=pd.DataFrame(compare, columns=compare_cols)
df_cc
Out[98]:
Linkage Cophenetic Coefficient
0 single 0.923227
1 complete 0.787328
2 average 0.942254
3 centroid 0.931401
4 ward 0.710118
5 weighted 0.869378
In [102]:
#list of disitance metrics
distance_metrics=["cityblock","euclidean"]

#list of linkage methods
linkage_methods=["average","single"]

#to create a subplot image
fig, axs=plt.subplots(
     len(distance_metrics)+len(distance_metrics), 1, figsize=(10,30)
)

i=0
for dm in distance_metrics:
    for lm in linkage_methods:
        Z=linkage(subset_scaled_df,metric=dm,method=lm)
        
        dendrogram(Z,ax=axs[i])
        axs[i].set_title("Distance metric: {}\nLinkage: {}".format(dm.capitalize(),lm))
        
        coph_corr,coph_dist=cophenet(Z, pdist(subset_scaled_df))
        axs[i].annotate(
            f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
            (0.80,0.80),
            xycoords="axes fraction",
        )
        i += 1

Observations

  • From all the values, it seems like ward is the best linkage to go with as it provides a good amount of clusters
  • According to the dendrogram, four clusters would be good

Creating Final Model¶

In [103]:
HCmodel=AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
Out[103]:
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In [104]:
#adding hierarchial cluster labels to the original and scaled dataframes

subset_scaled_df["HC_Clusters"]=HCmodel.labels_
df["HC_Clusters"]=HCmodel.labels_

Cluster profiling¶

In [105]:
cluster_profile=df.groupby("HC_Clusters").mean()
In [108]:
cluster_profile["count_in_each_segments"]=(df.groupby("HC_Clusters")["Net_Income"].count().values)
In [109]:
cluster_profile.style.highlight_max(color="lightgreen",axis=0)
Out[109]:
  Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio K_means_segments count_in_each_segments
HC_Clusters                          
0 48.006208 -11.263107 2.590247 196.551724 40.275862 -495901724.137931 -3597244655.172414 -8.689655 486319827.294483 75.110924 -2.162622 1.620690 29
1 326.198218 10.563242 1.642560 14.400000 309.466667 288850666.666667 864498533.333333 7.785333 544900261.301333 113.095334 19.142151 0.000000 15
2 42.848182 6.270446 1.123547 22.727273 71.454545 558636363.636364 14631272727.272728 3.410000 4242572567.290909 15.242169 -4.924615 2.000000 11
3 72.760400 5.213307 1.427078 25.603509 60.392982 79951512.280702 1538594322.807018 3.655351 446472132.228456 24.722670 -2.647194 3.701754 285
In [112]:
plt.figure(figsize=(15,10))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(4,3,i+1)
    sns.boxplot(data=df, x="HC_Clusters", y=variable)
    
plt.tight_layout(pad=2.0)
  • Cluster 0

    • cash_ratio has the lowest median, while volatility has the highest median
    • this cluster has a low current_price but the earnings per share is also at 0
    • the roe and net_income are also low, making this share not favorable
  • Cluster 1

    • p/e ratio has the lowest median, while price_change has the highest median
    • this cluster's price change is high, while the earnnings_per_share and roe are low at 0
    • the net_cash_flow is also at 0, making this cluster not favorable
  • Cluster 2

    • roe has the lowest median, while the estimated_shares_outstanding has the highest median
    • this cluster's volatility is low, but with earnings_per_share also being low, the stock might not even yield much in return
    • cash_ratio's also low, meaning they might not even have funds to pay of their debts which would lower their earnings more
  • Cluster 3
    • volatility has the highest median, while p/e ratios have the lowest median
    • the roe and earnings_per_share are both low
    • the estimated_shares_outstanding are also low, meaning that no one is buying the stocks

K-means vs Hierarchical Clustering¶

  • Which clustering technique took less time for execution?

    • k-means took much less time for execution
  • Which clustering technique gave you more distinct clusters, or are they the same?

    • hierarchial gave us more distinct clusters
  • How many observations are there in the similar clusters of both algorithms?

    • 13 observations from k means and 11 from hierarchial clustering
  • How many clusters are obtained as the appropriate number of clusters from both algorithms?

    • 5 clusters for k-means 4 clusters for hierarchial clustering

Actionable Insights and Recommendations¶

-